-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for disabling the sorting and list creation for WordNet object relation methods #3240
base: develop
Are you sure you want to change the base?
Conversation
…bject relation methods Add support for disabling the sorting and list creation for WordNet object relation methods. It keeps backward compatibility. See nltk#3193.
@bryant1410, how big a speed improvement do you typically observe on your use case, when setting sort and force_list to False? |
Hey, I did a quick test based on one of the functions I use. The setup is: pip install git+https://github.com/bryant1410/nltk@patch-5 def _get_sibling_or_cousin_co_hyponym_lemmas(synset):
for parent in synset.hypernyms():
for grandparent in parent.hypernyms():
for uncle_or_parent in grandparent.hyponyms():
for cousin_or_sibling in uncle_or_parent.hyponyms():
if cousin_or_sibling != synset:
yield from cousin_or_sibling.lemmas()
def _get_sibling_or_cousin_co_hyponym_lemmas_no_sort(synset):
for parent in synset.hypernyms(sort=False, force_list=False):
for grandparent in parent.hypernyms(sort=False, force_list=False):
for uncle_or_parent in grandparent.hyponyms(sort=False, force_list=False):
for cousin_or_sibling in uncle_or_parent.hyponyms(sort=False, force_list=False):
if cousin_or_sibling != synset:
yield from cousin_or_sibling.lemmas()
import timeit
from nltk.corpus import wordnet as wn
synset = wn.synset("man.n.01")
# We warm it up before benchmarking:
list(_get_sibling_or_cousin_co_hyponym_lemmas(synset)) With default values: >>> timeit.timeit(lambda: list(_get_sibling_or_cousin_co_hyponym_lemmas(synset)), number=100)
0.30820278730243444 With disabling sorting and list creations: >>> timeit.timeit(lambda: list(_get_sibling_or_cousin_co_hyponym_lemmas_no_sort(synset)), number=100)
0.20816301740705967 About 50% faster for this example. More examples (in milliseconds):
In my real use case example, I call this function multiple times per data loader worker (which, in turn, I have one per CPU core). It's part of a code that bottlenecks (starves) the GPU usage, as I have to compute thousands of these per second per data loader worker. |
There are two problems with the above test script. |
Note that the list conversion occurs to the final output, not the intermediate ones. The list creations that are avoided are each of the outputs to the Still, we can do a test that avoids the final Having said this, I later observed that using
Thanks for catching this! For my use case, it's fine, and I believe it's preferred. |
I also observe slightly better timings with force_list=True. The difference is small but consistent, so this additional parameter does not contribute to solving the speed bottleneck from #3193. Still, it may be useful for uses where one wants to consume the output one piece at a time instead of waiting for the whole list. |
Testing with more inputs, the speedup from force_list= False stays small, but it displays some randomness, and it is not always positive or negative. It may be questionable whether it is worth the cost of managing this additional parameter:
sort gain is the percentage speedup of setting sort=False, w.r.t. the baseline, and list gain is the additional speedup of also setting force_list=False.
In these tests, the functions to retrieve second degree cousins use sets, in order to avoid duplicates. This is much more effective (and in itself much quicker than the sort=False optimization) when the ratio of duplicates over unique cousins is high, as in the 'man.n.01' example, where it is 4430 over 2085. But when the ratio is low, as with 'dog.n.01' (212 over 204), using sets actually incurs a 20% runtime penalty. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if relation_symbol not in self._pointers:
return []
With force_list= False, the return type of _related() becomes unpredictable, because it is an iterator when the relation_symbol is present in self._pointers, but an empty list otherwise, in which case calling next() on the output fails with an Error.
It's an iterable regardless.
…On Sat, Mar 30, 2024, 5:19 AM Eric Kafe ***@***.***> wrote:
***@***.**** requested changes on this pull request.
if relation_symbol not in self._pointers:
return []
With *force_list= False*, the return type of *_related()* becomes
unpredictable, because it is an iterator when the *relation_symbol* is
present in *self._pointers*, but an empty list otherwise, in which case
calling *next()* on the output fails with an Error.
—
Reply to this email directly, view it on GitHub
<#3240 (review)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA5ZPXJLUDVN25AA2XKSDQTY2ZYQZAVCNFSM6AAAAABEZVFBUGVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTSNRZHA3TCOJWGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
But it is not an iterator: next([]) fails with
while next(iter([])) returns
|
You can do `iter` regardless, then `next`. Or a `for` loop directly.
…On Sat, Mar 30, 2024, 5:44 AM Eric Kafe ***@***.***> wrote:
It's an iterable regardless.
But it is not an iterator: *next([])* fails with
TypeError: 'list' object is not an iterator
while *next(iter([]))* returns
StopIteration
—
Reply to this email directly, view it on GitHub
<#3240 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA5ZPXOAWCX5IDCLUQESZPDY2Z3NHAVCNFSM6AAAAABEZVFBUGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRXHE4TIMBZGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Wouldn't you prefer that the output of _related(force_list=False) had the same predictable type, no matter whether targets were found or not? |
For me personally no but I can see how other people would prefer that. If
it's always e.g. an iterable type (that has `__iter__`) then it's fine for
me. So I have that one guarantee. Which makes sense with something not
being forced; it's more relaxed and something is not guaranteed then (that
is not a list).
…On Sat, Mar 30, 2024, 6:22 AM Eric Kafe ***@***.***> wrote:
Wouldn't you prefer that the output of _*related(force_list=False)* had
the same predictable type, no matter whether targets were found or not?
—
Reply to this email directly, view it on GitHub
<#3240 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA5ZPXN2B5URPVJNH3BNQPDY2Z76DAVCNFSM6AAAAABEZVFBUGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRYGAYDIMRXGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Add support for disabling the sorting and list creation for WordNet object relation methods. It keeps backward compatibility.
See #3193.